Milestone 1
Context: The used car market is outpacing the new car market in sales. How do we determine how to price old cars for maximum profit?
Objective: Our goal is to design a pricing model that can effectively predict the price of used cars and help the business in devising profitable strategies using differential pricing.
Key Questions: What key features determine car value? How can we use regression to determine maximum profitability?
S.No. : Serial Number
Name : Name of the car which includes Brand name and Model name
Location : The location in which the car is being sold or is available for purchase (Cities)
Year : Manufacturing year of the car
Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
Transmission : The type of transmission used by the car. (Automatic / Manual)
Owner : Type of ownership
Mileage : The standard mileage offered by the car company in kmpl or km/kg
Engine : The displacement volume of the engine in CC.
Power : The maximum power of the engine in bhp.
Seats : The number of seats in the car.
New_Price : The price of a new car of the same model in INR 100,000
Price : The price of the used car in INR 100,000 (Target Variable)
Library Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
#remove limit from the number of displayed columns and rows.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
Load Data
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/used_cars_a.csv')
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Exploratory Data Analysis
data.head(10)
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 |
| 1 | 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 |
| 2 | 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 |
| 3 | 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 |
| 4 | 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 |
| 5 | 5 | Hyundai EON LPG Era Plus Option | Hyderabad | 2012 | 75000 | LPG | Manual | First | 21.10 | 814.0 | 55.20 | 5.0 | NaN | 2.35 |
| 6 | 6 | Nissan Micra Diesel XV | Jaipur | 2013 | 86999 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | 3.50 |
| 7 | 7 | Toyota Innova Crysta 2.8 GX AT 8S | Mumbai | 2016 | 36000 | Diesel | Automatic | First | 11.36 | 2755.0 | 171.50 | 8.0 | 21.00 | 17.50 |
| 8 | 8 | Volkswagen Vento Diesel Comfortline | Pune | 2013 | 64430 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | 5.20 |
| 9 | 9 | Tata Indica Vista Quadrajet LS | Chennai | 2012 | 65932 | Diesel | Manual | Second | 22.30 | 1248.0 | 74.00 | 5.0 | NaN | 1.95 |
data.tail(10)
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7243 | 7243 | Renault Duster 85PS Diesel RxL | Chennai | 2015 | 70000 | Diesel | Manual | First | 19.87 | 1461.0 | 83.8 | 5.0 | NaN | NaN |
| 7244 | 7244 | Chevrolet Aveo 1.4 LS | Pune | 2009 | 45463 | Petrol | Manual | First | 14.49 | 1399.0 | 92.7 | 5.0 | NaN | NaN |
| 7245 | 7245 | Honda Amaze S i-Vtech | Kochi | 2015 | 44776 | Petrol | Manual | First | 18.00 | 1198.0 | 86.7 | 5.0 | NaN | NaN |
| 7246 | 7246 | Hyundai Grand i10 AT Asta | Coimbatore | 2016 | 18242 | Petrol | Automatic | First | 18.90 | 1197.0 | 82.0 | 5.0 | NaN | NaN |
| 7247 | 7247 | Hyundai EON D Lite Plus | Coimbatore | 2015 | 21190 | Petrol | Manual | First | 21.10 | 814.0 | 55.2 | 5.0 | NaN | NaN |
| 7248 | 7248 | Volkswagen Vento Diesel Trendline | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.6 | 5.0 | NaN | NaN |
| 7249 | 7249 | Volkswagen Polo GT TSI | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.6 | 5.0 | NaN | NaN |
| 7250 | 7250 | Nissan Micra Diesel XV | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.1 | 5.0 | NaN | NaN |
| 7251 | 7251 | Volkswagen Polo GT TSI | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.6 | 5.0 | NaN | NaN |
| 7252 | 7252 | Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.0 | 5.0 | NaN | NaN |
print(data.shape)
(7253, 14)
data.dtypes
S.No. int64 Name object Location object Year int64 Kilometers_Driven int64 Fuel_Type object Transmission object Owner_Type object Mileage float64 Engine float64 Power float64 Seats float64 New_price float64 Price float64 dtype: object
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| S.No. | 7253.0 | 3626.000000 | 2093.905084 | 0.00 | 1813.000 | 3626.00 | 5439.0000 | 7252.00 |
| Year | 7253.0 | 2013.365366 | 3.254421 | 1996.00 | 2011.000 | 2014.00 | 2016.0000 | 2019.00 |
| Kilometers_Driven | 7253.0 | 58699.063146 | 84427.720583 | 171.00 | 34000.000 | 53416.00 | 73000.0000 | 6500000.00 |
| Mileage | 7170.0 | 18.346527 | 4.157910 | 6.40 | 15.300 | 18.20 | 21.1000 | 33.54 |
| Engine | 7207.0 | 1616.573470 | 595.285137 | 72.00 | 1198.000 | 1493.00 | 1968.0000 | 5998.00 |
| Power | 7078.0 | 112.765214 | 53.493553 | 34.20 | 75.000 | 94.00 | 138.1000 | 616.00 |
| Seats | 7200.0 | 5.280417 | 0.809277 | 2.00 | 5.000 | 5.00 | 5.0000 | 10.00 |
| New_price | 1006.0 | 22.779692 | 27.759344 | 3.91 | 7.885 | 11.57 | 26.0425 | 375.00 |
| Price | 6019.0 | 9.479468 | 11.187917 | 0.44 | 3.500 | 5.64 | 9.9500 | 160.00 |
Year: 1996 - 2019
Kil_Driv: High std may be due to outliers. Max value is an outlier
Mileage: average is ~18
Engine: may have some outliers that skew distritbution
Power: has a high max value compared to 75th quartile
Seat: most cars have 5. min 2 - max 10.
New_price: most values missing; major imputation strategy needed
Price: min price is .44 and max is 160. 50% quartile is 5.64
data.nunique()
S.No. 7253 Name 2041 Location 11 Year 23 Kilometers_Driven 3660 Fuel_Type 5 Transmission 2 Owner_Type 4 Mileage 437 Engine 150 Power 383 Seats 8 New_price 625 Price 1373 dtype: int64
S.No. has 7253 values. We can drop this column as it will not add to analysis.
#droppin S.No. column
data.drop(['S.No.'], axis = 1, inplace = True)
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null object 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null object 5 Transmission 7253 non-null object 6 Owner_Type 7253 non-null object 7 Mileage 7170 non-null float64 8 Engine 7207 non-null float64 9 Power 7078 non-null float64 10 Seats 7200 non-null float64 11 New_price 1006 non-null float64 12 Price 6019 non-null float64 dtypes: float64(6), int64(2), object(5) memory usage: 736.8+ KB None
Mileage
Engine
Power
Seats
New_price
Price
# Checking the total number of missing values of each column.
(data.isnull().sum()/data.shape[0])*100
Name 0.000000 Location 0.000000 Year 0.000000 Kilometers_Driven 0.000000 Fuel_Type 0.000000 Transmission 0.000000 Owner_Type 0.000000 Mileage 1.144354 Engine 0.634220 Power 2.412795 Seats 0.730732 New_price 86.129877 Price 17.013650 dtype: float64
There are many values missing across the abovementioned categories. To begin the imputation process. We will create new columns for Brand and Model by splitting Name column. This data engineering feature will allow us to better impute values via groupby.
#Splitting Name column into 'Brand' and 'Model' column.
data[['Brand', 'Model']] = data['Name'].str.split(" ", 1, expand = True)
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null object 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null object 5 Transmission 7253 non-null object 6 Owner_Type 7253 non-null object 7 Mileage 7170 non-null float64 8 Engine 7207 non-null float64 9 Power 7078 non-null float64 10 Seats 7200 non-null float64 11 New_price 1006 non-null float64 12 Price 6019 non-null float64 13 Brand 7253 non-null object 14 Model 7253 non-null object dtypes: float64(6), int64(2), object(7) memory usage: 850.1+ KB None
# Checking missing values in the column 'Mileage'
data["Mileage"].isnull().sum()
83
data["Mileage"] = data.groupby(["Model"])["Mileage"].transform(
lambda x: x.fillna(x.median())
)
data["Mileage"].isnull().sum()
80
data["Mileage"] = data.groupby(["Brand"])["Mileage"].transform(
lambda x: x.fillna(x.median())
)
data["Mileage"].isnull().sum()
1
data['Mileage'] = data['Mileage'].fillna(data['Mileage'].median())
data["Mileage"].isnull().sum()
0
(data.isnull().sum()/data.shape[0])*100
Name 0.000000 Location 0.000000 Year 0.000000 Kilometers_Driven 0.000000 Fuel_Type 0.000000 Transmission 0.000000 Owner_Type 0.000000 Mileage 0.000000 Engine 0.634220 Power 2.412795 Seats 0.730732 New_price 86.129877 Price 17.013650 Brand 0.000000 Model 0.000000 dtype: float64
# Checking missing values in the column 'Engine'
data["Engine"].isnull().sum()
46
data["Engine"] = data.groupby(["Brand"])["Engine"].transform(
lambda x: x.fillna(x.median())
)
data["Engine"].isnull().sum()
0
# Checking missing values in the column 'Power'
data["Power"].isnull().sum()
175
data["Power"] = data.groupby(["Model"])["Power"].transform(
lambda x: x.fillna(x.median())
)
data["Power"].isnull().sum()
136
data["Power"] = data.groupby(["Brand"])["Power"].transform(
lambda x: x.fillna(x.median())
)
data["Power"].isnull().sum()
2
data['Power'] = data['Power'].fillna(data['Power'].median())
data["Power"].isnull().sum()
0
data["Seats"].isnull().sum()
53
data['Seats'] = data['Seats'].fillna(data['Seats'].mean())
data["Seats"].isnull().sum()
0
data["New_price"].isnull().sum()
6247
data["New_price"] = data.groupby(["Name"])["New_price"].transform(
lambda x: x.fillna(x.median())
)
data["New_price"].isnull().sum()
6020
data["New_price"] = data.groupby(["Model"])["New_price"].transform(
lambda x: x.fillna(x.median())
)
data["New_price"].isnull().sum()
6020
data["New_price"] = data.groupby(["Brand"])["New_price"].transform(
lambda x: x.fillna(x.median())
)
data["New_price"].isnull().sum()
162
data['New_price'] = data['New_price'].fillna(data['New_price'].median())
data["New_price"].isnull().sum()
0
data["Price"].isnull().sum()
1234
data["Price"] = data.groupby(["Name"])["Price"].transform(
lambda x: x.fillna(x.median())
)
data["Price"].isnull().sum()
178
data["Price"] = data.groupby(["Brand"])["Price"].transform(
lambda x: x.fillna(x.median())
)
data["Price"].isnull().sum()
2
data['Price'] = data['Price'].fillna(data['Price'].median())
data["Price"].isnull().sum()
0
None found.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 7253.0 | 2013.365366 | 3.254421 | 1996.00 | 2011.00 | 2014.00 | 2016.00 | 2019.00 |
| Kilometers_Driven | 7253.0 | 58699.063146 | 84427.720583 | 171.00 | 34000.00 | 53416.00 | 73000.00 | 6500000.00 |
| Mileage | 7253.0 | 18.327124 | 4.147499 | 6.40 | 15.29 | 18.20 | 21.10 | 33.54 |
| Engine | 7253.0 | 1616.119812 | 594.431397 | 72.00 | 1198.00 | 1493.00 | 1968.00 | 5998.00 |
| Power | 7253.0 | 112.264446 | 53.257490 | 34.20 | 75.00 | 92.70 | 138.03 | 616.00 |
| Seats | 7253.0 | 5.280417 | 0.806315 | 2.00 | 5.00 | 5.00 | 5.00 | 10.00 |
| New_price | 7253.0 | 20.073001 | 22.611560 | 3.91 | 9.51 | 11.15 | 15.38 | 375.00 |
| Price | 7253.0 | 9.348452 | 10.876285 | 0.44 | 3.50 | 5.50 | 9.80 | 160.00 |
Categorical Data Analysis
#explore summary of categorical (object) values
data.describe(include=object)
| Name | Location | Fuel_Type | Transmission | Owner_Type | Brand | Model | |
|---|---|---|---|---|---|---|---|
| count | 7253 | 7253 | 7253 | 7253 | 7253 | 7253 | 7253 |
| unique | 2041 | 11 | 5 | 2 | 4 | 33 | 2041 |
| top | Mahindra XUV500 W8 2WD | Mumbai | Diesel | Manual | First | Maruti | XUV500 W8 2WD |
| freq | 55 | 949 | 3852 | 5204 | 5952 | 1444 | 55 |
#Summary of most car Brands
(data.Brand.value_counts() / len(data))*100
Maruti 19.909003 Hyundai 18.475114 Honda 10.244037 Toyota 6.990211 Mercedes-Benz 5.239211 Volkswagen 5.156487 Ford 4.839377 Mahindra 4.563629 BMW 4.301668 Audi 3.929409 Tata 3.143527 Skoda 2.785054 Renault 2.343858 Chevrolet 2.081897 Nissan 1.613126 Land 0.923756 Jaguar 0.661795 Fiat 0.523921 Mitsubishi 0.496346 Mini 0.427409 Volvo 0.386047 Porsche 0.261961 Jeep 0.261961 Datsun 0.234386 ISUZU 0.041362 Force 0.041362 Isuzu 0.027575 Bentley 0.027575 Smart 0.013787 Ambassador 0.013787 Lamborghini 0.013787 Hindustan 0.013787 OpelCorsa 0.013787 Name: Brand, dtype: float64
Maruti 19.909003
Hyundai 18.475114
Honda 10.244037
Toyota 6.990211
Mercedes-Benz 5.239211
Volkswagen 5.156487
#Summary of most car Models
(data.Model.value_counts() / len(data))*100
XUV500 W8 2WD 0.758307
Swift VDI 0.675583
Swift Dzire VDI 0.579071
City 1.5 S MT 0.537709
Swift VDI BSIV 0.510134
...
Beat LT Option 0.013787
Rapid 1.6 MPI AT Elegance Plus 0.013787
EcoSport 1.5 TDCi Ambiente 0.013787
i10 Magna 1.1 iTech SE 0.013787
Elite i20 Magna Plus 0.013787
Name: Model, Length: 2041, dtype: float64
XUV500 W8 2WD 0.758307
Swift VDI 0.675583
Swift Dzire VDI 0.579071
City 1.5 S MT 0.537709
Swift VDI BSIV 0.510134
#Summary of most car Name
(data.Name.value_counts() / len(data))*100
Mahindra XUV500 W8 2WD 0.758307
Maruti Swift VDI 0.675583
Maruti Swift Dzire VDI 0.579071
Honda City 1.5 S MT 0.537709
Maruti Swift VDI BSIV 0.510134
...
Chevrolet Beat LT Option 0.013787
Skoda Rapid 1.6 MPI AT Elegance Plus 0.013787
Ford EcoSport 1.5 TDCi Ambiente 0.013787
Hyundai i10 Magna 1.1 iTech SE 0.013787
Hyundai Elite i20 Magna Plus 0.013787
Name: Name, Length: 2041, dtype: float64
Mahindra XUV500 W8 2WD 0.758307
Maruti Swift VDI 0.675583
Maruti Swift Dzire VDI 0.579071
Honda City 1.5 S MT 0.537709
Maruti Swift VDI BSIV 0.510134
#Summary of most car Locations
(data.Location.value_counts() / len(data))*100
Mumbai 13.084241 Hyderabad 12.077761 Coimbatore 10.643872 Kochi 10.643872 Pune 10.547360 Delhi 9.099683 Kolkata 9.016958 Chennai 8.148352 Jaipur 6.879912 Bangalore 6.066455 Ahmedabad 3.791535 Name: Location, dtype: float64
Mumbai 13.084241
Hyderabad 12.077761
Coimbatore 10.643872
Kochi 10.643872
Pune 10.547360
Univariate Analysis - Numerical Data
# Let us write a function that will help us create a boxplot and histogram for any input numerical variable.
# This function takes the numerical column as the input and returns the boxplots and histograms for the variable.
def histogram_boxplot(feature, figsize=(15,10), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
Year int64 - considered an 'object'
Kilometers_Driven int64
Mileage float64
Engine float64
Power float64
Seats float64
New_price float64
Price float64
#Plot histogram and box-plot for 'Kilometers_Driven'
histogram_boxplot(data['Kilometers_Driven'])
data.sort_values(["Kilometers_Driven"], ascending = False).head(5)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2328 | BMW X5 xDrive 30d M Sport | Chennai | 2017 | 6500000 | Diesel | Automatic | First | 15.97 | 2993.0 | 258.00 | 5.0 | 54.43 | 65.0 | BMW | X5 xDrive 30d M Sport |
| 340 | Skoda Octavia Ambition Plus 2.0 TDI AT | Kolkata | 2013 | 775000 | Diesel | Automatic | First | 19.30 | 1968.0 | 141.00 | 5.0 | 15.38 | 7.5 | Skoda | Octavia Ambition Plus 2.0 TDI AT |
| 1860 | Volkswagen Vento Diesel Highline | Chennai | 2013 | 720000 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | 11.26 | 5.9 | Volkswagen | Vento Diesel Highline |
| 358 | Hyundai i10 Magna 1.2 | Chennai | 2009 | 620000 | Petrol | Manual | First | 20.36 | 1197.0 | 78.90 | 5.0 | 9.78 | 2.7 | Hyundai | i10 Magna 1.2 |
| 2823 | Volkswagen Jetta 2013-2015 2.0L TDI Highline AT | Chennai | 2015 | 480000 | Diesel | Automatic | First | 16.96 | 1968.0 | 138.03 | 5.0 | 11.26 | 13.0 | Volkswagen | Jetta 2013-2015 2.0L TDI Highline AT |
data.drop(index = 2328, inplace=True)
#preferred method is to impute median value of cars in similar category.
#Re-plot histogram and box-plot for 'Kilometers_Driven'
histogram_boxplot(data['Kilometers_Driven'])
### We can add a transformed kilometers_driven feature in data
data["kilometers_driven_log"] = np.log(data["Kilometers_Driven"])
#Log transformation of the feature 'Kilometers_Driven'
sns.distplot(np.log(data["Kilometers_Driven"]), axlabel="Log(Kilometers_Driven)")
<matplotlib.axes._subplots.AxesSubplot at 0x7f822b5e60d0>
#Plot histogram and box-plot for 'Mileage'
histogram_boxplot(data['Mileage'])
Mileage has a normal distribution. We also see two top mode values.
#Plot histogram and box-plot for 'Engine'
histogram_boxplot(data['Engine'])
#Plot histogram and box-plot for 'Power'
histogram_boxplot(data['Power'])
# Log transformation of the feature 'Power'
sns.distplot(np.log(data["Power"]), axlabel="Log(Power)");
### We can add a transformed Power feature in data
data["power_log"] = np.log(data["Power"])
histogram_boxplot(data['New_price'])
histogram_boxplot(data['Price'])
# Log transformation of the feature 'Price'
sns.distplot(np.log(data["Price"]), axlabel="Log(Price)");
# We can Add a transformed Price feature in data
data["price_log"] = np.log(data["Price"])
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7252 entries, 0 to 7252 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7252 non-null object 1 Location 7252 non-null object 2 Year 7252 non-null int64 3 Kilometers_Driven 7252 non-null int64 4 Fuel_Type 7252 non-null object 5 Transmission 7252 non-null object 6 Owner_Type 7252 non-null object 7 Mileage 7252 non-null float64 8 Engine 7252 non-null float64 9 Power 7252 non-null float64 10 Seats 7252 non-null float64 11 New_price 7252 non-null float64 12 Price 7252 non-null float64 13 Brand 7252 non-null object 14 Model 7252 non-null object 15 kilometers_driven_log 7252 non-null float64 16 power_log 7252 non-null float64 17 price_log 7252 non-null float64 dtypes: float64(9), int64(2), object(7) memory usage: 1.1+ MB
Univariate Analysis - Categorical Data
# Let us write a function that will help us create barplots that indicate the percentage for each category.
# This function takes the categorical column as the input and returns the barplots for the variable.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[z]) # length of the column
plt.figure(figsize=(15,5))
ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
plt.show() # show the plot
Name object
Location object
Fuel_Type object
Transmission object
Owner_Type object
#Bar Plot for 'Location'
perc_on_bar('Brand')
#Bar Plot for 'Location'
perc_on_bar('Location')
#Bar Plot for 'Location'
perc_on_bar('Fuel_Type')
#Bar Plot for 'Location'
perc_on_bar('Transmission')
#Bar Plot for 'Location'
perc_on_bar('Owner_Type')
There are 2 top brands in the Indian market: Maruti and Hyundai
Location: Mumbai, Hyderabad and Coimbatore have the highest sales
Most cars are powered by Diesel or Petrol
Most cars are manual
Most buyers are First Time
Bivariate Analysis - Scatter Plot
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Year', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f82282e93d0>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Mileage', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f822b82e910>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Engine', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f822b87c7d0>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Kilometers_Driven', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f822b4ebf50>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Power', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f822b54db50>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Seats', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f822b5aa590>
plt.figure(figsize=(20, 20))
sns.pairplot(data, diag_kind='kde');
<Figure size 1440x1440 with 0 Axes>
import seaborn as sns
sns.lmplot(x='Year', y='Price', hue='Location',
data=data.loc[data['Location'].isin(['Mumbai','Hyderabad','Coimbatore'])],
fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7f822594d850>
Bivariate Analysis - Heat Map
#We can include the log transformation values and drop the original skewed data columns
plt.figure(figsize=(12, 7))
sns.heatmap(data.corr(), annot = True, vmin = -1, vmax = 1)
plt.show()
Power, Engine and Price has strong correlations. We've seen from our pairplot that Engine and Power have a positive correlation. These 2 features have an effect on the Price and New_Price of a used car at ~ 0.66 - .077.
Engine and Mileage have a negative relationship.The higher the mileage on the car may affect the Engine (and Power).
Bivariate Analysis - Box Plot
# Let us write a function that will help us create boxplot w.r.t Price for any input categorical variable.
# This function takes the categorical column as the input and returns the boxplots for the variable.
def boxplot(z):
plt.figure(figsize=(12, 5)) #setting size of boxplot
sns.boxplot(x=z, y=data['Price']) # defining x and y
plt.show()
plt.figure(figsize=(12, 5))
plt.title('Without Outliers')
sns.boxplot(x=z, y=data['Price'],showfliers=False) #turning off the outliers
plt.show()
boxplot(data['Location'])
boxplot(data['Fuel_Type'])
boxplot(data['Transmission'])
boxplot(data['Owner_Type'])
boxplot(data['Brand'])
We will create a Linear Regression model to see if we can plot the price point predictions.
We will complete Ridge and Lasso Regression tests
We will complete a Decision Tree model
We will complete a Random Forest model
Using the abovementioned techniques, we will prepare a train and test data set to see if the model correctly predicts the price.
To test the success of our models we will review our r-square results. R-squared measures the strength of the relationship between our model and the dependent variable on a convenient 0 – 100% scale. The model with the best R-square percent, is wht we will go with to create our algorithm for future predictions.
data.to_csv('used_cars_b.csv', index=False)
Milestone 1
Context: The used car market is outpacing the new car market in sales. How do we determine how to price old cars for maximum profit?
Objective: Our goal is to design a pricing model that can effectively predict the price of used cars and help the business in devising profitable strategies using differential pricing.
Key Questions: What key features determine car value? How can we use regression to determine maximum profitability?
S.No. : Serial Number
Name : Name of the car which includes Brand name and Model name
Location : The location in which the car is being sold or is available for purchase (Cities)
Year : Manufacturing year of the car
Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
Transmission : The type of transmission used by the car. (Automatic / Manual)
Owner : Type of ownership
Mileage : The standard mileage offered by the car company in kmpl or km/kg
Engine : The displacement volume of the engine in CC.
Power : The maximum power of the engine in bhp.
Seats : The number of seats in the car.
New_Price : The price of a new car of the same model in INR 100,000
Price : The price of the used car in INR 100,000 (Target Variable)
Library Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
#remove limit from the number of displayed columns and rows.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
Load Data
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/used_cars_a.csv')
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Exploratory Data Analysis
data.head(10)
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 |
| 1 | 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 |
| 2 | 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 |
| 3 | 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 |
| 4 | 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 |
| 5 | 5 | Hyundai EON LPG Era Plus Option | Hyderabad | 2012 | 75000 | LPG | Manual | First | 21.10 | 814.0 | 55.20 | 5.0 | NaN | 2.35 |
| 6 | 6 | Nissan Micra Diesel XV | Jaipur | 2013 | 86999 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | 3.50 |
| 7 | 7 | Toyota Innova Crysta 2.8 GX AT 8S | Mumbai | 2016 | 36000 | Diesel | Automatic | First | 11.36 | 2755.0 | 171.50 | 8.0 | 21.00 | 17.50 |
| 8 | 8 | Volkswagen Vento Diesel Comfortline | Pune | 2013 | 64430 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | 5.20 |
| 9 | 9 | Tata Indica Vista Quadrajet LS | Chennai | 2012 | 65932 | Diesel | Manual | Second | 22.30 | 1248.0 | 74.00 | 5.0 | NaN | 1.95 |
data.tail(10)
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7243 | 7243 | Renault Duster 85PS Diesel RxL | Chennai | 2015 | 70000 | Diesel | Manual | First | 19.87 | 1461.0 | 83.8 | 5.0 | NaN | NaN |
| 7244 | 7244 | Chevrolet Aveo 1.4 LS | Pune | 2009 | 45463 | Petrol | Manual | First | 14.49 | 1399.0 | 92.7 | 5.0 | NaN | NaN |
| 7245 | 7245 | Honda Amaze S i-Vtech | Kochi | 2015 | 44776 | Petrol | Manual | First | 18.00 | 1198.0 | 86.7 | 5.0 | NaN | NaN |
| 7246 | 7246 | Hyundai Grand i10 AT Asta | Coimbatore | 2016 | 18242 | Petrol | Automatic | First | 18.90 | 1197.0 | 82.0 | 5.0 | NaN | NaN |
| 7247 | 7247 | Hyundai EON D Lite Plus | Coimbatore | 2015 | 21190 | Petrol | Manual | First | 21.10 | 814.0 | 55.2 | 5.0 | NaN | NaN |
| 7248 | 7248 | Volkswagen Vento Diesel Trendline | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.6 | 5.0 | NaN | NaN |
| 7249 | 7249 | Volkswagen Polo GT TSI | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.6 | 5.0 | NaN | NaN |
| 7250 | 7250 | Nissan Micra Diesel XV | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.1 | 5.0 | NaN | NaN |
| 7251 | 7251 | Volkswagen Polo GT TSI | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.6 | 5.0 | NaN | NaN |
| 7252 | 7252 | Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.0 | 5.0 | NaN | NaN |
print(data.shape)
(7253, 14)
data.dtypes
S.No. int64 Name object Location object Year int64 Kilometers_Driven int64 Fuel_Type object Transmission object Owner_Type object Mileage float64 Engine float64 Power float64 Seats float64 New_price float64 Price float64 dtype: object
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| S.No. | 7253.0 | 3626.000000 | 2093.905084 | 0.00 | 1813.000 | 3626.00 | 5439.0000 | 7252.00 |
| Year | 7253.0 | 2013.365366 | 3.254421 | 1996.00 | 2011.000 | 2014.00 | 2016.0000 | 2019.00 |
| Kilometers_Driven | 7253.0 | 58699.063146 | 84427.720583 | 171.00 | 34000.000 | 53416.00 | 73000.0000 | 6500000.00 |
| Mileage | 7170.0 | 18.346527 | 4.157910 | 6.40 | 15.300 | 18.20 | 21.1000 | 33.54 |
| Engine | 7207.0 | 1616.573470 | 595.285137 | 72.00 | 1198.000 | 1493.00 | 1968.0000 | 5998.00 |
| Power | 7078.0 | 112.765214 | 53.493553 | 34.20 | 75.000 | 94.00 | 138.1000 | 616.00 |
| Seats | 7200.0 | 5.280417 | 0.809277 | 2.00 | 5.000 | 5.00 | 5.0000 | 10.00 |
| New_price | 1006.0 | 22.779692 | 27.759344 | 3.91 | 7.885 | 11.57 | 26.0425 | 375.00 |
| Price | 6019.0 | 9.479468 | 11.187917 | 0.44 | 3.500 | 5.64 | 9.9500 | 160.00 |
Year: 1996 - 2019
Kil_Driv: High std may be due to outliers. Max value is an outlier
Mileage: average is ~18
Engine: may have some outliers that skew distritbution
Power: has a high max value compared to 75th quartile
Seat: most cars have 5. min 2 - max 10.
New_price: most values missing; major imputation strategy needed
Price: min price is .44 and max is 160. 50% quartile is 5.64
data.nunique()
S.No. 7253 Name 2041 Location 11 Year 23 Kilometers_Driven 3660 Fuel_Type 5 Transmission 2 Owner_Type 4 Mileage 437 Engine 150 Power 383 Seats 8 New_price 625 Price 1373 dtype: int64
S.No. has 7253 values. We can drop this column as it will not add to analysis.
#droppin S.No. column
data.drop(['S.No.'], axis = 1, inplace = True)
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null object 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null object 5 Transmission 7253 non-null object 6 Owner_Type 7253 non-null object 7 Mileage 7170 non-null float64 8 Engine 7207 non-null float64 9 Power 7078 non-null float64 10 Seats 7200 non-null float64 11 New_price 1006 non-null float64 12 Price 6019 non-null float64 dtypes: float64(6), int64(2), object(5) memory usage: 736.8+ KB None
Mileage
Engine
Power
Seats
New_price
Price
# Checking the total number of missing values of each column.
(data.isnull().sum()/data.shape[0])*100
Name 0.000000 Location 0.000000 Year 0.000000 Kilometers_Driven 0.000000 Fuel_Type 0.000000 Transmission 0.000000 Owner_Type 0.000000 Mileage 1.144354 Engine 0.634220 Power 2.412795 Seats 0.730732 New_price 86.129877 Price 17.013650 dtype: float64
There are many values missing across the abovementioned categories. To begin the imputation process. We will create new columns for Brand and Model by splitting Name column. This data engineering feature will allow us to better impute values via groupby.
#Splitting Name column into 'Brand' and 'Model' column.
data[['Brand', 'Model']] = data['Name'].str.split(" ", 1, expand = True)
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null object 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null object 5 Transmission 7253 non-null object 6 Owner_Type 7253 non-null object 7 Mileage 7170 non-null float64 8 Engine 7207 non-null float64 9 Power 7078 non-null float64 10 Seats 7200 non-null float64 11 New_price 1006 non-null float64 12 Price 6019 non-null float64 13 Brand 7253 non-null object 14 Model 7253 non-null object dtypes: float64(6), int64(2), object(7) memory usage: 850.1+ KB None
# Checking missing values in the column 'Mileage'
data["Mileage"].isnull().sum()
83
data["Mileage"] = data.groupby(["Model"])["Mileage"].transform(
lambda x: x.fillna(x.median())
)
data["Mileage"].isnull().sum()
80
data["Mileage"] = data.groupby(["Brand"])["Mileage"].transform(
lambda x: x.fillna(x.median())
)
data["Mileage"].isnull().sum()
1
data['Mileage'] = data['Mileage'].fillna(data['Mileage'].median())
data["Mileage"].isnull().sum()
0
(data.isnull().sum()/data.shape[0])*100
Name 0.000000 Location 0.000000 Year 0.000000 Kilometers_Driven 0.000000 Fuel_Type 0.000000 Transmission 0.000000 Owner_Type 0.000000 Mileage 0.000000 Engine 0.634220 Power 2.412795 Seats 0.730732 New_price 86.129877 Price 17.013650 Brand 0.000000 Model 0.000000 dtype: float64
# Checking missing values in the column 'Engine'
data["Engine"].isnull().sum()
46
data["Engine"] = data.groupby(["Brand"])["Engine"].transform(
lambda x: x.fillna(x.median())
)
data["Engine"].isnull().sum()
0
# Checking missing values in the column 'Power'
data["Power"].isnull().sum()
175
data["Power"] = data.groupby(["Model"])["Power"].transform(
lambda x: x.fillna(x.median())
)
data["Power"].isnull().sum()
136
data["Power"] = data.groupby(["Brand"])["Power"].transform(
lambda x: x.fillna(x.median())
)
data["Power"].isnull().sum()
2
data['Power'] = data['Power'].fillna(data['Power'].median())
data["Power"].isnull().sum()
0
data["Seats"].isnull().sum()
53
data['Seats'] = data['Seats'].fillna(data['Seats'].mean())
data["Seats"].isnull().sum()
0
data["New_price"].isnull().sum()
6247
data["New_price"] = data.groupby(["Name"])["New_price"].transform(
lambda x: x.fillna(x.median())
)
data["New_price"].isnull().sum()
6020
data["New_price"] = data.groupby(["Model"])["New_price"].transform(
lambda x: x.fillna(x.median())
)
data["New_price"].isnull().sum()
6020
data["New_price"] = data.groupby(["Brand"])["New_price"].transform(
lambda x: x.fillna(x.median())
)
data["New_price"].isnull().sum()
162
data['New_price'] = data['New_price'].fillna(data['New_price'].median())
data["New_price"].isnull().sum()
0
data["Price"].isnull().sum()
1234
data["Price"] = data.groupby(["Name"])["Price"].transform(
lambda x: x.fillna(x.median())
)
data["Price"].isnull().sum()
178
data["Price"] = data.groupby(["Brand"])["Price"].transform(
lambda x: x.fillna(x.median())
)
data["Price"].isnull().sum()
2
data['Price'] = data['Price'].fillna(data['Price'].median())
data["Price"].isnull().sum()
0
None found.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 7253.0 | 2013.365366 | 3.254421 | 1996.00 | 2011.00 | 2014.00 | 2016.00 | 2019.00 |
| Kilometers_Driven | 7253.0 | 58699.063146 | 84427.720583 | 171.00 | 34000.00 | 53416.00 | 73000.00 | 6500000.00 |
| Mileage | 7253.0 | 18.327124 | 4.147499 | 6.40 | 15.29 | 18.20 | 21.10 | 33.54 |
| Engine | 7253.0 | 1616.119812 | 594.431397 | 72.00 | 1198.00 | 1493.00 | 1968.00 | 5998.00 |
| Power | 7253.0 | 112.264446 | 53.257490 | 34.20 | 75.00 | 92.70 | 138.03 | 616.00 |
| Seats | 7253.0 | 5.280417 | 0.806315 | 2.00 | 5.00 | 5.00 | 5.00 | 10.00 |
| New_price | 7253.0 | 20.073001 | 22.611560 | 3.91 | 9.51 | 11.15 | 15.38 | 375.00 |
| Price | 7253.0 | 9.348452 | 10.876285 | 0.44 | 3.50 | 5.50 | 9.80 | 160.00 |
Categorical Data Analysis
#explore summary of categorical (object) values
data.describe(include=object)
| Name | Location | Fuel_Type | Transmission | Owner_Type | Brand | Model | |
|---|---|---|---|---|---|---|---|
| count | 7253 | 7253 | 7253 | 7253 | 7253 | 7253 | 7253 |
| unique | 2041 | 11 | 5 | 2 | 4 | 33 | 2041 |
| top | Mahindra XUV500 W8 2WD | Mumbai | Diesel | Manual | First | Maruti | XUV500 W8 2WD |
| freq | 55 | 949 | 3852 | 5204 | 5952 | 1444 | 55 |
#Summary of most car Brands
(data.Brand.value_counts() / len(data))*100
Maruti 19.909003 Hyundai 18.475114 Honda 10.244037 Toyota 6.990211 Mercedes-Benz 5.239211 Volkswagen 5.156487 Ford 4.839377 Mahindra 4.563629 BMW 4.301668 Audi 3.929409 Tata 3.143527 Skoda 2.785054 Renault 2.343858 Chevrolet 2.081897 Nissan 1.613126 Land 0.923756 Jaguar 0.661795 Fiat 0.523921 Mitsubishi 0.496346 Mini 0.427409 Volvo 0.386047 Porsche 0.261961 Jeep 0.261961 Datsun 0.234386 ISUZU 0.041362 Force 0.041362 Isuzu 0.027575 Bentley 0.027575 Smart 0.013787 Ambassador 0.013787 Lamborghini 0.013787 Hindustan 0.013787 OpelCorsa 0.013787 Name: Brand, dtype: float64
Maruti 19.909003
Hyundai 18.475114
Honda 10.244037
Toyota 6.990211
Mercedes-Benz 5.239211
Volkswagen 5.156487
#Summary of most car Models
(data.Model.value_counts() / len(data))*100
XUV500 W8 2WD 0.758307
Swift VDI 0.675583
Swift Dzire VDI 0.579071
City 1.5 S MT 0.537709
Swift VDI BSIV 0.510134
...
Beat LT Option 0.013787
Rapid 1.6 MPI AT Elegance Plus 0.013787
EcoSport 1.5 TDCi Ambiente 0.013787
i10 Magna 1.1 iTech SE 0.013787
Elite i20 Magna Plus 0.013787
Name: Model, Length: 2041, dtype: float64
XUV500 W8 2WD 0.758307
Swift VDI 0.675583
Swift Dzire VDI 0.579071
City 1.5 S MT 0.537709
Swift VDI BSIV 0.510134
#Summary of most car Name
(data.Name.value_counts() / len(data))*100
Mahindra XUV500 W8 2WD 0.758307
Maruti Swift VDI 0.675583
Maruti Swift Dzire VDI 0.579071
Honda City 1.5 S MT 0.537709
Maruti Swift VDI BSIV 0.510134
...
Chevrolet Beat LT Option 0.013787
Skoda Rapid 1.6 MPI AT Elegance Plus 0.013787
Ford EcoSport 1.5 TDCi Ambiente 0.013787
Hyundai i10 Magna 1.1 iTech SE 0.013787
Hyundai Elite i20 Magna Plus 0.013787
Name: Name, Length: 2041, dtype: float64
Mahindra XUV500 W8 2WD 0.758307
Maruti Swift VDI 0.675583
Maruti Swift Dzire VDI 0.579071
Honda City 1.5 S MT 0.537709
Maruti Swift VDI BSIV 0.510134
#Summary of most car Locations
(data.Location.value_counts() / len(data))*100
Mumbai 13.084241 Hyderabad 12.077761 Coimbatore 10.643872 Kochi 10.643872 Pune 10.547360 Delhi 9.099683 Kolkata 9.016958 Chennai 8.148352 Jaipur 6.879912 Bangalore 6.066455 Ahmedabad 3.791535 Name: Location, dtype: float64
Mumbai 13.084241
Hyderabad 12.077761
Coimbatore 10.643872
Kochi 10.643872
Pune 10.547360
Univariate Analysis - Numerical Data
# Let us write a function that will help us create a boxplot and histogram for any input numerical variable.
# This function takes the numerical column as the input and returns the boxplots and histograms for the variable.
def histogram_boxplot(feature, figsize=(15,10), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
Year int64 - considered an 'object'
Kilometers_Driven int64
Mileage float64
Engine float64
Power float64
Seats float64
New_price float64
Price float64
#Plot histogram and box-plot for 'Kilometers_Driven'
histogram_boxplot(data['Kilometers_Driven'])
data.sort_values(["Kilometers_Driven"], ascending = False).head(5)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2328 | BMW X5 xDrive 30d M Sport | Chennai | 2017 | 6500000 | Diesel | Automatic | First | 15.97 | 2993.0 | 258.00 | 5.0 | 54.43 | 65.0 | BMW | X5 xDrive 30d M Sport |
| 340 | Skoda Octavia Ambition Plus 2.0 TDI AT | Kolkata | 2013 | 775000 | Diesel | Automatic | First | 19.30 | 1968.0 | 141.00 | 5.0 | 15.38 | 7.5 | Skoda | Octavia Ambition Plus 2.0 TDI AT |
| 1860 | Volkswagen Vento Diesel Highline | Chennai | 2013 | 720000 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | 11.26 | 5.9 | Volkswagen | Vento Diesel Highline |
| 358 | Hyundai i10 Magna 1.2 | Chennai | 2009 | 620000 | Petrol | Manual | First | 20.36 | 1197.0 | 78.90 | 5.0 | 9.78 | 2.7 | Hyundai | i10 Magna 1.2 |
| 2823 | Volkswagen Jetta 2013-2015 2.0L TDI Highline AT | Chennai | 2015 | 480000 | Diesel | Automatic | First | 16.96 | 1968.0 | 138.03 | 5.0 | 11.26 | 13.0 | Volkswagen | Jetta 2013-2015 2.0L TDI Highline AT |
data.drop(index = 2328, inplace=True)
#preferred method is to impute median value of cars in similar category.
#Re-plot histogram and box-plot for 'Kilometers_Driven'
histogram_boxplot(data['Kilometers_Driven'])
### We can add a transformed kilometers_driven feature in data
data["kilometers_driven_log"] = np.log(data["Kilometers_Driven"])
#Log transformation of the feature 'Kilometers_Driven'
sns.distplot(np.log(data["Kilometers_Driven"]), axlabel="Log(Kilometers_Driven)")
<matplotlib.axes._subplots.AxesSubplot at 0x7f8225daf410>
#Plot histogram and box-plot for 'Mileage'
histogram_boxplot(data['Mileage'])
Mileage has a normal distribution. We also see two top mode values.
#Plot histogram and box-plot for 'Engine'
histogram_boxplot(data['Engine'])
#Plot histogram and box-plot for 'Power'
histogram_boxplot(data['Power'])
# Log transformation of the feature 'Power'
sns.distplot(np.log(data["Power"]), axlabel="Log(Power)");
### We can add a transformed Power feature in data
data["power_log"] = np.log(data["Power"])
histogram_boxplot(data['New_price'])
histogram_boxplot(data['Price'])
# Log transformation of the feature 'Price'
sns.distplot(np.log(data["Price"]), axlabel="Log(Price)");
# We can Add a transformed Price feature in data
data["price_log"] = np.log(data["Price"])
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7252 entries, 0 to 7252 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7252 non-null object 1 Location 7252 non-null object 2 Year 7252 non-null int64 3 Kilometers_Driven 7252 non-null int64 4 Fuel_Type 7252 non-null object 5 Transmission 7252 non-null object 6 Owner_Type 7252 non-null object 7 Mileage 7252 non-null float64 8 Engine 7252 non-null float64 9 Power 7252 non-null float64 10 Seats 7252 non-null float64 11 New_price 7252 non-null float64 12 Price 7252 non-null float64 13 Brand 7252 non-null object 14 Model 7252 non-null object 15 kilometers_driven_log 7252 non-null float64 16 power_log 7252 non-null float64 17 price_log 7252 non-null float64 dtypes: float64(9), int64(2), object(7) memory usage: 1.1+ MB
Univariate Analysis - Categorical Data
# Let us write a function that will help us create barplots that indicate the percentage for each category.
# This function takes the categorical column as the input and returns the barplots for the variable.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[z]) # length of the column
plt.figure(figsize=(15,5))
ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
plt.show() # show the plot
Name object
Location object
Fuel_Type object
Transmission object
Owner_Type object
#Bar Plot for 'Location'
perc_on_bar('Brand')
#Bar Plot for 'Location'
perc_on_bar('Location')
#Bar Plot for 'Location'
perc_on_bar('Fuel_Type')
#Bar Plot for 'Location'
perc_on_bar('Transmission')
#Bar Plot for 'Location'
perc_on_bar('Owner_Type')
There are 2 top brands in the Indian market: Maruti and Hyundai
Location: Mumbai, Hyderabad and Coimbatore have the highest sales
Most cars are powered by Diesel or Petrol
Most cars are manual
Most buyers are First Time
Bivariate Analysis - Scatter Plot
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Year', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f822b6d56d0>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Mileage', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8225e6ed10>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Engine', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f82286fe850>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Kilometers_Driven', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f82468c7610>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Power', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8225b18e10>
#Let us plot pair plot for the variables 'year' and 'price_log'
data.plot(x='Seats', y='Price', style='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8225989350>
plt.figure(figsize=(20, 20))
sns.pairplot(data, diag_kind='kde');
<Figure size 1440x1440 with 0 Axes>
import seaborn as sns
sns.lmplot(x='Year', y='Price', hue='Location',
data=data.loc[data['Location'].isin(['Mumbai','Hyderabad','Coimbatore'])],
fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7f8226afe890>
Bivariate Analysis - Heat Map
#We can include the log transformation values and drop the original skewed data columns
plt.figure(figsize=(12, 7))
sns.heatmap(data.corr(), annot = True, vmin = -1, vmax = 1)
plt.show()
Power, Engine and Price has strong correlations. We've seen from our pairplot that Engine and Power have a positive correlation. These 2 features have an effect on the Price and New_Price of a used car at ~ 0.66 - .077.
Engine and Mileage have a negative relationship.The higher the mileage on the car may affect the Engine (and Power).
Bivariate Analysis - Box Plot
# Let us write a function that will help us create boxplot w.r.t Price for any input categorical variable.
# This function takes the categorical column as the input and returns the boxplots for the variable.
def boxplot(z):
plt.figure(figsize=(12, 5)) #setting size of boxplot
sns.boxplot(x=z, y=data['Price']) # defining x and y
plt.show()
plt.figure(figsize=(12, 5))
plt.title('Without Outliers')
sns.boxplot(x=z, y=data['Price'],showfliers=False) #turning off the outliers
plt.show()
boxplot(data['Location'])
boxplot(data['Fuel_Type'])
boxplot(data['Transmission'])
boxplot(data['Owner_Type'])
boxplot(data['Brand'])
We will create a Linear Regression model to see if we can plot the price point predictions.
We will complete Ridge and Lasso Regression tests
We will complete a Decision Tree model
We will complete a Random Forest model
Using the abovementioned techniques, we will prepare a train and test data set to see if the model correctly predicts the price.
To test the success of our models we will review our r-square results. R-squared measures the strength of the relationship between our model and the dependent variable on a convenient 0 – 100% scale. The model with the best R-square percent, is wht we will go with to create our algorithm for future predictions.
data.to_csv('used_cars_b.csv', index=False)
Milestone 2
#Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to scale the data using z-score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#algorithms to use
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
#Metrics to evaluate the model
#from metrics import confusion_matrix, classification_report, precision_recall_curve
import sklearn.metrics as metrics
#for tuning the model
from sklearn.model_selection import GridSearchCV
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
cars_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/used_cars_b.csv')
# Step-1
X = cars_data.drop(['Name','Price', 'New_price', 'price_log','Kilometers_Driven','Power','Brand', 'Model'],axis=1)
y = cars_data[["price_log", "Price"]]
# Step-2 Use pd.get_dummies(drop_first=True)
X = pd.get_dummies(X,drop_first=True)
# Step-3 Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, X_test.shape)
(5801, 24) (1451, 24)
# Let us write a function for calculating r2_score and RMSE on train and test data.
# This function takes model as an input on which we have trained particular algorithm.
#the categorical column as the input and returns the boxplots and histograms for the variable.
def get_model_score(model, flag=True):
'''
model : regressor to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_train_ = np.exp(pred_train)
pred_test = model.predict(X_test)
pred_test_ = np.exp(pred_test)
train_r2=metrics.r2_score(y_train['Price'],pred_train_)
test_r2=metrics.r2_score(y_test['Price'],pred_test_)
train_rmse=metrics.mean_squared_error(y_train['Price'],pred_train_,squared=False)
test_rmse=metrics.mean_squared_error(y_test['Price'],pred_test_,squared=False)
#Adding all scores in the list
score_list.extend((train_r2,test_r2,train_rmse,test_rmse))
# If the flag is set to True then only the following print statements will be dispayed, the default value is True
if flag==True:
print("R-sqaure on training set : ",metrics.r2_score(y_train['Price'],pred_train_))
print("R-square on test set : ",metrics.r2_score(y_test['Price'],pred_test_))
print("RMSE on training set : ",np.sqrt(metrics.mean_squared_error(y_train['Price'],pred_train_)))
print("RMSE on test set : ",np.sqrt(metrics.mean_squared_error(y_test['Price'],pred_test_)))
# returning the list with train and test scores
return score_list
Linear Regression can be implemented using:
1) Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
2) Statsmodels: https://www.statsmodels.org/stable/regression.html
# import Linear Regression from sklearn
from sklearn.linear_model import LinearRegression
# Create a linear regression model
lr = LinearRegression()
# Fit linear regression model
lr.fit(X_train,y_train['price_log'])
LinearRegression()
# Get score of the model.
LR_score = get_model_score(lr)
R-sqaure on training set : 0.8007269984569956 R-square on test set : 0.8360390936533956 RMSE on training set : 4.864069633039157 RMSE on test set : 4.331237485296903
Our LR model has a poor performance.
The trained set has -0.21
The test set has 0.83
Building a model using statsmodels
# Import Statsmodels
import statsmodels.api as sm
# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train = sm.add_constant(X_train)
# Add constant to test data
x_test = sm.add_constant(X_test)
def build_ols_model(train):
# Create the model
olsmodel = sm.OLS(y_train["price_log"], train)
return olsmodel.fit()
# Fit linear model on new dataset
olsmodel1 = build_ols_model(x_train)
print(olsmodel1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.868
Model: OLS Adj. R-squared: 0.867
Method: Least Squares F-statistic: 1650.
Date: Fri, 22 Apr 2022 Prob (F-statistic): 0.00
Time: 03:50:58 Log-Likelihood: -1495.9
No. Observations: 5801 AIC: 3040.
Df Residuals: 5777 BIC: 3200.
Df Model: 23
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -206.2136 3.766 -54.754 0.000 -213.597 -198.830
Year 0.1016 0.002 54.455 0.000 0.098 0.105
Mileage -0.0074 0.002 -4.069 0.000 -0.011 -0.004
Engine 0.0002 1.71e-05 13.085 0.000 0.000 0.000
Seats -0.0151 0.007 -2.230 0.026 -0.028 -0.002
kilometers_driven_log -0.0854 0.008 -11.232 0.000 -0.100 -0.071
power_log 0.9723 0.023 42.786 0.000 0.928 1.017
Location_Bangalore 0.1715 0.027 6.312 0.000 0.118 0.225
Location_Chennai 0.0209 0.026 0.812 0.417 -0.030 0.071
Location_Coimbatore 0.0995 0.025 3.981 0.000 0.050 0.148
Location_Delhi -0.0335 0.025 -1.328 0.184 -0.083 0.016
Location_Hyderabad 0.1324 0.024 5.450 0.000 0.085 0.180
Location_Jaipur -0.0361 0.027 -1.358 0.174 -0.088 0.016
Location_Kochi -0.0078 0.025 -0.313 0.755 -0.057 0.041
Location_Kolkata -0.2025 0.025 -7.957 0.000 -0.252 -0.153
Location_Mumbai -0.0550 0.024 -2.265 0.024 -0.103 -0.007
Location_Pune -0.0401 0.025 -1.613 0.107 -0.089 0.009
Fuel_Type_Diesel 0.0675 0.043 1.569 0.117 -0.017 0.152
Fuel_Type_Electric -2.407e-17 2.28e-17 -1.055 0.292 -6.88e-17 2.07e-17
Fuel_Type_LPG 0.0092 0.108 0.085 0.933 -0.203 0.222
Fuel_Type_Petrol -0.1977 0.044 -4.514 0.000 -0.284 -0.112
Transmission_Manual -0.2941 0.012 -23.545 0.000 -0.319 -0.270
Owner_Type_Fourth & Above 0.2535 0.095 2.656 0.008 0.066 0.441
Owner_Type_Second -0.0704 0.012 -5.731 0.000 -0.094 -0.046
Owner_Type_Third -0.1685 0.031 -5.380 0.000 -0.230 -0.107
==============================================================================
Omnibus: 816.558 Durbin-Watson: 2.012
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5893.393
Skew: -0.463 Prob(JB): 0.00
Kurtosis: 7.850 Cond. No. inf
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
# Retrive Coeff values, p-values and store them in the dataframe
olsmod = pd.DataFrame(olsmodel1.params, columns=['coef'])
olsmod['pval']=olsmodel1.pvalues
# FIlter by significant p-value (pval <0.05) and sort descending by Odds ratio
olsmod = olsmod.sort_values(by="pval", ascending=False)
pval_filter = olsmod['pval']<=0.05
olsmod[pval_filter]
| coef | pval | |
|---|---|---|
| Seats | -0.015058 | 2.577327e-02 |
| Location_Mumbai | -0.055016 | 2.355857e-02 |
| Owner_Type_Fourth & Above | 0.253476 | 7.932442e-03 |
| Location_Coimbatore | 0.099462 | 6.946309e-05 |
| Mileage | -0.007399 | 4.774269e-05 |
| Fuel_Type_Petrol | -0.197681 | 6.493069e-06 |
| Owner_Type_Third | -0.168549 | 7.723192e-08 |
| Location_Hyderabad | 0.132405 | 5.254650e-08 |
| Owner_Type_Second | -0.070399 | 1.049853e-08 |
| Location_Bangalore | 0.171503 | 2.954549e-10 |
| Location_Kolkata | -0.202491 | 2.111913e-15 |
| kilometers_driven_log | -0.085415 | 5.641127e-29 |
| Engine | 0.000224 | 1.416485e-38 |
| Transmission_Manual | -0.294059 | 4.029685e-117 |
| Year | 0.101611 | 0.000000e+00 |
| power_log | 0.972344 | 0.000000e+00 |
| const | -206.213558 | 0.000000e+00 |
# we are looking are overall significant varaible
pval_filter = olsmod['pval']<=0.05
imp_vars = olsmod[pval_filter].index.tolist()
# we are going to get overall varaibles (un-one-hot encoded varables) from categorical varaibles
sig_var = []
for col in imp_vars:
if '' in col:
first_part = col.split('_')[0]
for c in cars_data.columns:
if first_part in c and c not in sig_var :
sig_var.append(c)
start = '\033[1m'
end = '\033[95m'
print(start+'Most overall significant categorical varaibles of LINEAR REGRESSION are '+end,':\n',sig_var)
Most overall significant categorical varaibles of LINEAR REGRESSION are : ['Seats', 'Location', 'Owner_Type', 'Mileage', 'Fuel_Type', 'kilometers_driven_log', 'Engine', 'Transmission', 'Year', 'power_log']
# import Ridge/ Lasso Regression from sklearn
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
# Create a Ridge regression model
ridge = Ridge()
# Fit Ridge regression model.
ridge.fit(X_train, y_train['price_log'])
Ridge()
# Get score of the model.
Ridge_score = get_model_score(ridge)
R-sqaure on training set : 0.8004773258788276 R-square on test set : 0.8357989638701134 RMSE on training set : 4.867115817522105 RMSE on test set : 4.334407992872399
The R square values are still poor for the Ridge Regression model. We may have completed a log to log transformation which is yielding a negative training result. The R square value has improved from our LR model, however, with a negative fit on the training data, we cannot trust the test data to be accurate.
# Create a Lasso regression model
lasso = Lasso()
# Fit Lasso regression model.
lasso.fit(X_train, y_train['price_log'])
Lasso()
# Get score of the model.
model = get_model_score(lasso)
R-sqaure on training set : -0.16403523418136645 R-square on test set : 0.32370310583063533 RMSE on training set : 11.755973961307594 RMSE on test set : 8.796512788476972
The R square values are still poor for the Lasso Regression model. We may have completed a log to log transformation which is yielding a negative training result. The R square value has de-proved from our LR and RR model.
# import Decision tree for Regression from sklearn
#Metrics to evaluate the model
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve,recall_score
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestRegressor
# Create a decision tree regression model
#dtree = _____(random_state=1)
dtree = DecisionTreeRegressor(random_state=1)
# Fit decision tree regression model.
#dtree.fit(_______,_______)
dtree.fit(X_train,y_train['price_log'])
DecisionTreeRegressor(random_state=1)
# Get score of the model.
#Dtree_model = get_model_score(_____)
Dtree_model = get_model_score(dtree)
R-sqaure on training set : 0.9999930605171452 R-square on test set : 0.7774826032085173 RMSE on training set : 0.028703780872666153 RMSE on test set : 5.045727905759686
The R square values from our Decision Tree so great improvement from our previous models. The training set is currently overfitted at 0.99. Our test set has seen a step improvement and is reporting at 0.78. This may be the preferred model.
The RMSE on the training set is overfitted. The RMSE on the test set has a large number. This may be due to an incorrect log fit in the engineering phase or in the set up of our X and Y values.
Print the importance of features in the tree building ( The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(dtree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp power_log 0.653973 Year 0.209695 Engine 0.040325 Mileage 0.027651 kilometers_driven_log 0.022871 Seats 0.009740 Transmission_Manual 0.006370 Location_Kolkata 0.004112 Fuel_Type_Petrol 0.003969 Location_Hyderabad 0.003120 Location_Coimbatore 0.002267 Location_Bangalore 0.002125 Location_Mumbai 0.002103 Owner_Type_Second 0.002010 Location_Delhi 0.001703 Owner_Type_Third 0.001638 Location_Kochi 0.001340 Location_Jaipur 0.001243 Location_Chennai 0.001210 Location_Pune 0.000920 Fuel_Type_Diesel 0.000861 Owner_Type_Fourth & Above 0.000671 Fuel_Type_LPG 0.000085 Fuel_Type_Electric 0.000000
The Data Engineering of New_price may be causing the poor R square values we are seeing in our model. We will need to review our imputations to be sure. I do not believe New_price should be an important feature in this analysis as most of the values were imputed by Brand, Model and Year.
#Random Forest imported in LN 22 of "Decision Tree"
# Create a Randomforest regression model
rf_estimator = RandomForestRegressor(random_state=1)
# Fit Randomforest regression model.
#rf_estimator.fit(X_train,y_train)
rf_estimator.fit(X_train,y_train['price_log'])
RandomForestRegressor(random_state=1)
# Get score of the model.
RF_model = get_model_score(rf_estimator)
R-sqaure on training set : 0.967849517478101 R-square on test set : 0.8904270955100108 RMSE on training set : 1.9537513901182544 RMSE on test set : 3.5407350395734185
The R square values from our Random Forest show great improvement from our Decision Tree model. The training set is overfitted at 0.97. Our test set has seen a step improvement and is reporting at 0.84. This may be the preferred model over Decision Tree.
Let us review feature importance and compare to the Decision Tree.
# Print important features
print(pd.DataFrame(rf_estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp power_log 0.656957 Year 0.212186 Engine 0.035721 Mileage 0.025428 kilometers_driven_log 0.025070 Seats 0.007448 Transmission_Manual 0.006326 Location_Kolkata 0.004381 Location_Hyderabad 0.003414 Fuel_Type_Diesel 0.002958 Location_Coimbatore 0.002532 Owner_Type_Second 0.002129 Fuel_Type_Petrol 0.002084 Location_Mumbai 0.002017 Location_Delhi 0.001793 Location_Bangalore 0.001683 Location_Pune 0.001542 Location_Jaipur 0.001513 Owner_Type_Third 0.001483 Location_Kochi 0.001402 Location_Chennai 0.001220 Owner_Type_Fourth & Above 0.000628 Fuel_Type_LPG 0.000084 Fuel_Type_Electric 0.000000
I do not believe New_price should be an important feature in this analysis as most of the values were imputed by Brand, Model and Year. Engine, Power, Kilometers_driven_log all showed a true correlation to price in our Bivariate Analysis.
Hyperparameter Tuning: Decision Tree
#Choose the type of estimator.
#dtree_tuned = __________(random_state=1)
dtree_tuned = DecisionTreeRegressor(random_state=1)
# Grid of parameters to choose from.
# Check documentation for all the parametrs that the model takes and play with those.
#parameters = {________________}
parameters = {'max_depth': [None],
'criterion': ['squared_error','friedman_mse'],
'min_samples_leaf': [1, 3, 5],
'max_leaf_nodes' : [2, 5, 7] + [None],
}
# Type of scoring used to compare parameter combinations
#scorer = _________
scorer = metrics.make_scorer(metrics.r2_score)
# Run the grid search
#grid_obj = GridSearchCV(_____________)
#grid_obj = grid_obj.fit(______________)
#grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring=scorer, cv=5)
#grid_obj = grid_obj.fit(x_train, y_train)
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
#dtree_tuned.fit(____,____)
dtree_tuned.fit(X_train,y_train['price_log'])
DecisionTreeRegressor(criterion='friedman_mse', min_samples_leaf=5,
random_state=1)
# Get score of the dtree_tuned
model = get_model_score(dtree_tuned)
R-sqaure on training set : 0.9153709854962969 R-square on test set : 0.815267572715803 RMSE on training set : 3.169824327552197 RMSE on test set : 4.597411843229507
The dtree_tuned model shows dramatic improvement from its predecessor dtree. The model is still overfitted, but it is steadily learning.
Hyperparameter Tuning: Random Forest
# Choose the type of Regressor.
rfr_tuned = RandomForestRegressor(random_state=1)
# Define the parameters for Grid to choose from
parameters = {'max_depth': [None],
'criterion': ['squared_error','friedman_mse'],
'min_samples_leaf': [1, 3, 5, 7],
'max_leaf_nodes' : [2, 5, 7] + [None],
}
# Check documentation for all the parametrs that the model takes and play with those
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)
# Run the grid search
grid_obj = GridSearchCV(rfr_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rfr_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rfr_tuned.fit(X_train,y_train['price_log'])
RandomForestRegressor(criterion='friedman_mse', random_state=1)
# Get score of the model.
model = get_model_score(rfr_tuned)
R-sqaure on training set : 0.9674410519270459 R-square on test set : 0.8891069645481378 RMSE on training set : 1.9661232320784119 RMSE on test set : 3.5620005129842527
The R square values from our Random Forest tuned are the same from our non-tuned Random Forest model.
Let us review feature importance and compare to the original Random Forest features of importance.
# Print important features of tuned random forest
print(pd.DataFrame(rfr_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp power_log 0.656930 Year 0.212301 Engine 0.035817 Mileage 0.025490 kilometers_driven_log 0.024923 Seats 0.007442 Transmission_Manual 0.006314 Location_Kolkata 0.004430 Location_Hyderabad 0.003454 Location_Coimbatore 0.002612 Fuel_Type_Diesel 0.002514 Fuel_Type_Petrol 0.002444 Owner_Type_Second 0.002102 Location_Mumbai 0.002031 Location_Delhi 0.001771 Location_Bangalore 0.001673 Location_Pune 0.001545 Location_Jaipur 0.001521 Owner_Type_Third 0.001487 Location_Kochi 0.001294 Location_Chennai 0.001197 Owner_Type_Fourth & Above 0.000624 Fuel_Type_LPG 0.000085 Fuel_Type_Electric 0.000000
The features of importance across all models are New_price, power_log, and Year in varying percentages.
All Models Comparison
# defining list of models you have trained
models = [lr, ridge, lasso, dtree, rf_estimator, dtree_tuned, rfr_tuned]
# defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train= []
rmse_test= []
# looping through all the models to get the rmse and r2 scores
for model in models:
# accuracy score
j = get_model_score(model,False)
r2_train.append(j[0])
r2_test.append(j[1])
rmse_train.append(j[2])
rmse_test.append(j[3])
comparison_frame = pd.DataFrame({'Model':['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Decision Tree','Random Forest','Decision Tree Tuned','Random Forest Tuned'],
'Train_r2': r2_train,'Test_r2': r2_test,
'Train_RMSE':rmse_train,'Test_RMSE':rmse_test})
comparison_frame
| Model | Train_r2 | Test_r2 | Train_RMSE | Test_RMSE | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 0.800727 | 0.836039 | 4.864070 | 4.331237 |
| 1 | Ridge Regression | 0.800477 | 0.835799 | 4.867116 | 4.334408 |
| 2 | Lasso Regression | -0.164035 | 0.323703 | 11.755974 | 8.796513 |
| 3 | Decision Tree | 0.999993 | 0.777483 | 0.028704 | 5.045728 |
| 4 | Random Forest | 0.967850 | 0.890427 | 1.953751 | 3.540735 |
| 5 | Decision Tree Tuned | 0.915371 | 0.815268 | 3.169824 | 4.597412 |
| 6 | Random Forest Tuned | 0.967441 | 0.889107 | 1.966123 | 3.562001 |
Out of the 7 predictive models, we found the Random Forest model performed best at ~89%.